ITC-Net-blend-60: a comprehensive dataset for robust network traffic classification in diverse environments

Objectives Recognition of mobile applications within encrypted network traffic holds considerable effects across multiple domains, encompassing network administration, security, and digital marketing. The creation of network traffic classifiers capable of adjusting to dynamic and unforeseeable real-world settings presents a tremendous challenge. Presently available datasets exclusively encompass traffic data obtained from a singular network environment, thereby restricting their utility in evaluating the robustness and compatibility of a given model. Data description This dataset was gathered from 60 popular Android applications in five different network scenarios, with the intention of overcoming the limitations of previous datasets. The scenarios were the same in the applications set but differed in terms of Internet service provider (ISP), geographic location, device, application version, and individual users. The traffic was generated through real human interactions on physical devices for 3–15 min. The method used to capture the traffic did not require root privileges on mobile phones and filtered out any background traffic. In total, the collected dataset comprises over 48 million packets, 450K bidirectional flows, and 36 GB of data. Supplementary Information The online version contains supplementary material available at 10.1186/s13104-024-06817-5.

As the volume of mobile app traffic continues to soar, the importance of reliable app identification solutions cannot be understated.Mobile app traffic identification aids network management and security and provides valuable profiling information for advertisers, insurance companies, and security agencies [12][13][14].As a result, it has garnered significant interest from both academia and industry, leading to extensive research on the subject.However, achieving robust application identification remains an open problem.Recent investigations [15,16] have revealed that although existing classifiers achieve satisfactory performance when trained and tested using conventional machine learning methods (dividing a dataset into two parts for training and testing), most of them face significant performance degradation when evaluated with different datasets.This indicates a lack of robustness and compatibility of models in practical networks.This challenge stems from the unpredictable nature of real-world network environments [15,16] and the dynamic and evolving behavior of mobile apps [13,[17][18][19].A key requirement for achieving this goal is a dataset of captured traffic data in various network scenarios.Nonetheless, as indicated in S Table 3, existing datasets were mainly captured in a single invariant network environment.
In this paper, we address this limitation by presenting a dataset that was captured across five different network scenarios, with various factors affecting network traffic behavior.This dataset allows for the evaluation of model performance under different network conditions.It was provided in raw format (PCAP files) to give researchers the flexibility to develop models based on any traffic object (e.g., packet, flow, and bag of flows), feature, method, or innovative approaches.Moreover, this dataset was generated through real human interactions on actual smartphones and captured using a non-rooted method.This makes the data more representative and suitable than synthetic data for mobile app traffic analysis.

Data description
The methodology employed for collecting the dataset comprised three main stages: Application Selection, Traffic Capture Setup, and Traffic Generation.Moore's details about each phase are provided in the accompanying supplementary materials.

Application selection
To collect traffic data, an initial step is to determine which applications to monitor, given the vast number of applications available.We chose 60 Android applications from the top 300 free apps listed in the Google Play Store and two major Iranian Android app markets, Cafe Bazaar [1], and Myket [2].Our selection was based on two criteria: the apps must require internet connectivity to fulfill their core functions, and they must generate traffic through user interactions.These 60 apps belong to 16 distinct categories, which are listed in S Table A1.

Traffic capture setup
We used a smartphone and a laptop to capture our traffic.The laptop ran Windows 10 and had an internal dual-band network card.We installed Wireshark [3] on it and configured it to capture traffic through the "Local Area Connection" interface.The laptop was connected to the internet and shared its connection with the smartphone via a hotspot.This allowed Wireshark to capture the smartphone's network traffic.However, the traffic captured by Wireshark also contained significant background traffic.To isolate the target application's network traffic, we installed PCAPdroid [4] on the smartphone in non-root mode.We used Wireshark and PCAPdroid simultaneously to record the target application's traffic.
After collecting traffic data, we separated the target application traffic from the background traffic by comparing IP addresses and ports captured by Wireshark and PCAPdroid.Any pairs in Wireshark that did not match a PCAPdroid pair were identified as background traffic and removed.
We implemented this method in Python 3 using the Scapy library [5].You can find the code for this implementation in the Supplementary material.

Traffic generation
The dataset collection was conducted by five volunteers from ITC-LAB members over a period of six weeks, from October to December 2021.Each volunteer collected traffic from a different network Scenario (see S Table 1).Before commencing the data collection process, they were well-informed about the objectives of traffic capture and the public release of data in PCAP format.They also received training on how to collect traffic.The volunteers were required to conduct at least three experiments for every application, with each experiment consisting of interacting with a single app on a specific smartphone for 3 to 15 min.They were instructed to use the application as they normally would.
The resulting dataset is organized into separate repositories for each scenario, with a dedicated compressed file for every application.Each compressed file contains the corresponding PCAP files, all of which have been named using a consistent naming convention.The format for naming the PCAP files is as follows: (Application Name)_ (Scenario ID)_(#Trace)_Final.pcap.
The entire dataset comprises 1,159 PCAP traces and 36 GB of network traffic data.S Table A2     information, please refer to the detailed description provided in the Supplementary Materials.) provides further details about the dataset for each app.